Thirty Days of Metal — Day 14: Perspective
This series of posts is my attempt to present the Metal graphics programming framework in small, bite-sized chunks for Swift app developers who haven’t done GPU programming before.
If you want to work through this series in order, start here.
In the previous article we entered the third dimension by introducing depth buffering, which allows us to implicitly sort surfaces by their distance.
So far, we’ve mostly avoided a detailed discussion of the different coordinate spaces we use when going from vertices to pixels. Today, we’ll take a closer look at exactly what our transformations are doing and introduce a new type of projection transformation.
Model Space
In previous articles, I have alluded to a mesh’s “local origin,” which implies that the mesh’s vertices are specified relative to some coordinate system. We call this coordinate system “model space” or “object space.”
Usually, the author of the 3D model decides which coordinate system to model in. For objects as well as legged creatures, it often makes sense to place the origin on an imagined ground plane, since many objects spend most of their time resting on flat surfaces. On the other hand, it may be more useful to select the object’s center of gravity as its local origin instead.
As for selecting the scale of an object, it is often useful to use real-world units like meters as the units of the model. This allows objects to be composed together in larger scenes without having to adjust their scales independently. For some applications that involve very large or very small distances, it may make sense to use different units, to avoid issues with numerical precision that can occur when transformations are composed.
Now that we know how vertices are defined in meshes and models, let’s talk about how we build virtual worlds by situating objects relative to one another.
World Space
As its name implies, world space is a global coordinate system relative to which objects can be arranged. We move between model space and world space by assigning each object a model-to-world transformation, colloquially called a model transformation (or world transformation), that moves points from model space into world space. We have seen model transformations in action when combining rotations, translations, and scaling in 2D.
As an example of a 3D model transformation, consider placing two spheres together in a scene. One sphere might be given a model transformation that translates it two units to the left, while the other is translated two units to the right.
In model space, the vertices of the two spheres are identical, but each vertex is transformed according to its sphere’s model transformation, which causes the vertices to have different coordinates in world space. In this same way, we can build a scene comprised of many different objects.
View Space
Now we know how to situation objects relative to one another in a unified, global coordinate space: we give each object its own model transformation.
We have mentioned the notion of a “virtual camera” a couple of times without adequately defining it. When we speak of the camera in graphics, we mean the point of view from which we look at the scene, along with other parameters such as the field of view.
The position and orientation of the camera together form a coordinate space called view space, eye space, or camera space. Once the camera is positioned, its orientation can be set by selecting directions that point right (the x axis) and up (the y axis). By the right-hand rule, the z axis then points toward the viewer, nottoward the world as you might expect. According to this convention, we view the world along the camera’s negative-z axis.
To find the view transformation, we take the inverse of the camera’s model-to-world transformation. To see why, consider how transformations move us between coordinate systems: if a transformation takes us from a coordinate system A to a coordinate system B, its inverse takes us from B back to A. Since we want to put the world in front of the camera rather than vice versa, we use the inverse as the view transformation.
We refer to the region of space that is visible as the viewing volume. Any points not inside the viewing volume will not appear in our rendered image.
Once we introduce perspective, the world-space viewing volume is not a rectangular prism. Instead, it is a shape called a frustum, which is like a rectangular pyramid with its top cut off. The pyramid’s primary axis is oriented along the camera’s (negative) z axis, with the apex at the camera’s location. The view frustum is illustrated below.
That W Coordinate
In the same way that 3D points and vectors have 3 components (x, y, and z), 4D points and vectors have 4 components: x, y, z, and w. This last component has already been useful to us in unifying the different types of transformations into matrix form, but it has additional uses.
Recall that vectors have the same magnitude and direction regardless of where they are in a coordinate space. This implies that they transform differently from points. Specifically, applying a translation matrix to a vector should have no effect. We can achieve this by setting the vector’s w component to 0, which zeroes out the translational component of a transformation matrix.
On the other hand, we need points to be able to translate, so we set their w component to 1. This gives us the 3D translation results we want.
What about other values of w? Is there ever a time when we want values between 0 and 1, or perhaps greater than 1 or less than zero?
Clip Space
When working in two dimensions, we frequently said that we were working in NDC. This is only partially true. The true job of the vertex function is to produce positions in clip space.
So what is clip space? Clip space is a coordinate space in which the viewing volume is bounded along the x and y axes between -w to w and the z axis between 0 to w. What does it mean for a space to be bounded in this way?
As we will discuss below, the projection matrix calculates the w coordinate of the output position different depending on the intended effect of the projection. This means that some projection matrices will produce w values that are not 0 or 1. As part of the vertex processing pipeline, vertices’ x, y, and z coordinates are compared to their w coordinate to determine whether they are inside the viewing volume.
The figure below illustrates clip space, with several points labeled.
This method of determining whether a vertex is “in” or “out” is what gives clip space its name: primitives are “clipped” against the boundary of the viewing volume on their way to rasterization.
One quirk of clip space is that it is “left-handed”: the z axis points away from the viewer. This contrasts with our chosen convention for model space, world space, and view space, which are all right-handed. This implies that we need to flip the z axis when moving from view space to clip space. Indeed, we did this implicitly with the orthogonal projection matrix we introduced on day 9. It is necessary to remember this difference when formulating other projection matrices.
The Perspective Divide
The job of the vertex function is to produce positions in clip space. In addition to scaling and translating points from their view-space positions into the appropriate ranges, the projection transform also does something else: it sets up the w coordinate for the perspective divide.
Just after the vertex function runs, before rasterization, the x, y, and z components of each vertex are divided by the w coordinate. Why would we do such a thing? The primary reason is to introduce perspective, hence this process is called the perspective divide.
Due to a phenomenon called convergence, the human vision system perceives distant objects to be closer to the center of the field of vision. One example of this phenomenon is how train tracks appear to converge in the distance.
We achieve perspective in 3D rendering by deriving the w coordinate from the z coordinate as part of our projection matrix in the vertex shader. Then, when the perspective divide occurs, more distant z values — which are larger in magnitude — cause x and y to shrink proportionately, producing convergence and foreshortening (the illusion that parts of objects that are farther away are smaller).
Our orthogonal projection matrix never produced points whose w components were anything other than 1. This is because an orthogonal projection is a parallel projection: parallel lines remain parallel after projection rather than converging. If we actually want perspective, we need a new kind of projection matrix.
Perspective Projection
There are numerous ways to parameterize a projection matrix. One very popular set of parameters is: near plane distance, and far plane distance, aspect ratio, and field of view.
We have already used near and far z values to delineate our view-space viewing volume when performing orthographic projection. In the same way, when building a projection matrix, we select near and far z values to define the distance to the near plane and far plane. By convention these values are positive, but because the z axis of view space points toward the viewer, they actually represent distances along the negative z axis. This is accounted for in the projection matrix math.
If our rendered image is square, we do not need to account for the aspect ratio, since the image plane is also a square in normalized device coordinates. However, since we will often be displaying our rendered images in a non-square view, we need to adjust our projection matrix accordingly.
We also need to select a field of view angle to determine how much of the virtual world is visible. We can select either a horizontal field of view or vertical field of view, then compute the other based on the aspect ratio. Here, we will choose a vertical field of view, which is the angular measure between the top and bottom planes of the view frustum. Larger angles capture more of the scene, but using extremely large angles produces non-physical distortion. The goal should be to choose a sensible angle that accounts for the context of the virtual camera, the dimensions of the rendered image, and the expected viewing distance of the user.
We will not go through the complete derivation of perspective projection here. If you are curious about the theory, I recommend Song Ho Ahn’s excellent article on the subject. For now, here is a Swift function that generates a projection matrix given the parameters above:
init(perspectiveProjectionFoVY fovYRadians: Float,
aspectRatio: Float,
near: Float,
far: Float)
{
let sy = 1 / tan(fovYRadians * 0.5)
let sx = sy / aspectRatio
let zRange = far - near
let sz = -(far + near) / zRange
let tz = -2 * far * near / zRange
self.init(SIMD4<Float>(sx, 0, 0, 0),
SIMD4<Float>(0, sy, 0, 0),
SIMD4<Float>(0, 0, sz, -1),
SIMD4<Float>(0, 0, tz, 0))
}Replacing our existing projection matrix with a perspective projection matrix enables us to produce images that have more realistic proportions. Check out the sample code to see how to add support for multiple objects and introduce a view transformation that positions the virtual camera somewhere other than the origin.
We will discuss these changes and learn about how to create hierarchical arrangements of objects next time.